A Fast Parallel Random Forest Algorithm Based on Spark
نویسندگان
چکیده
To improve the computational efficiency and classification accuracy in context of big data, an optimized parallel random forest algorithm is proposed based on Spark computing framework. First, a new Gini coefficient defined to reduce impact feature redundancy for higher accuracy. Next, number candidate split points calculations continuous features, approximate equal-frequency binning method determine optimal efficiently. Finally, Apache framework, sampling index (FSI) table speed up training process decision trees data communication overhead. Experimental results show that improves constructing forests while ensuring accuracy, superior Spark-MLRF terms performance scalability.
منابع مشابه
A Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis (RESEARCH NOTE)
Machine learning-based classification techniques provide support for the decision making process in the field of healthcare, especially in disease diagnosis, prognosis and screening. Healthcare datasets are voluminous in nature and their high dimensionality problem comprises in terms of slower learning rate and higher computational cost. Feature selection is expected to deal with the high dimen...
متن کاملCLUS: Parallel Subspace Clustering Algorithm on Spark
Subspace clustering techniques were proposed to discover hidden clusters that only exist in certain subsets of the full feature spaces. However, the time complexity of such algorithms is at most exponential with respect to the dimensionality of the dataset. In addition, datasets are generally too large to fit in a single machine under the current big data scenarios. The extremely high computati...
متن کاملDiagnosis of Diabetes Using a Random Forest Algorithm
Background: Diabetes is the fourth leading cause of death in the world. And because so many people around the world have the disease, or are at risk for it, diabetes can be called the disease of the century. Diabetes has devastating effects on the health of people in the community and if diagnosed late, it can cause irreparable damage to vision, kidneys, heart, arteries and so on. Therefore, it...
متن کاملAn Improved Fast Compressive Tracking Algorithm Based on Online Random Forest Classifier
The fast compressive tracking (FCT) algorithm is a simple and efficient algorithm, which is proposed in recent years. But, it is difficult to deal with the factors such as occlusion, appearance changes, pose variation, etc in processing. The reasons are that, Firstly, even if the naive Bayes classifier is fast in training, it is not robust concerning the noise. Secondly, the parameters are requ...
متن کاملMergeShuffle: A Very Fast, Parallel Random Permutation Algorithm
This article introduces an algorithm, MERGESHUFFLE, which is an extremely efficient algorithm to generate random permutations (or to randomly permute an existing array). It is easy to implement, runs in nlog2n + O(1) time, is in-place, uses nlog2n + Θ(n) random bits, and can be parallelized accross any number of processes, in a shared-memory PRAM model. Finally, our preliminary simulations usin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied sciences
سال: 2023
ISSN: ['2076-3417']
DOI: https://doi.org/10.3390/app13106121